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Abstract 

We present a novel approach to learn a kernel- 
based regression function. It is based on the use 
of conical combinations of data-based parameter- 
ized kernels and on a new stochastic convex op- 
timization procedure of which we establish con- 
vergence guarantees. The overall learning pro- 
cedure has the nice properties that a) the learned 
conical combination is automatically designed to 
perform the regression task at hand and b) the 
updates implicated by the optimization proce- 
dure are quite inexpensive. In order to shed light 
on the appositeness of our learning strategy, we 
present empirical results from experiments con- 
ducted on various benchmark datasets. 



1. Introduction 

Our goal is to learn a kernel-based regression function, 
tackling at once two problems that commonly arise with 
kernel methods: working with a kernel tailored to the 
task at hand and efficiently handling problems whose size 
prevents the Gram matrix from being stored in memory. 
Though the present work focuses on regression, the mate- 
rial presented here might as well apply to classification. 

Compared with similar methods, we introduce two nov- 
elties. Firstly, we build conical combinations of rank-1 
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Nystrom approximations, whose weights are chosen so as 
to serve the regression task - this makes our approach dif- 
ferent from (Kumar et al., 2009) and (Suykens et al., 2002), 
which focus on approximating the full Gram matrix with no 
concern for any specific learning task. Secondly, to solve 
the convex optimization problem entailed by our model- 
ing choice, we provide an original stochastic optimization 
procedure based on (Nesterov, 2010). It has the follow- 
ing characteristics: i) the computations of the updates are 
inexpensive (thanks to the designing choice of using rank- 
1 approximations) and ii) the convergence is guaranteed. 
To assess the practicality and effectiveness of our learning 
procedure, we conduct a few experiments on benchmark 
datasets, which allow us to draw positive conclusions on 
the relevance of our approach. 

The paper is organized as follows. Section 2 introduces 
some notation and our learning setting; in particular the 
optimization problem we are interested in and the rank- 
1 parametrization of the kernel our approach builds upon. 
Section 3 describes our new stochastic optimization proce- 
dure, establishes guarantees of convergence and details the 
computations to be implemented. Section 4 discusses the 
hyperparameters inherent to our modeling as well as the 
complexity of the proposed algorithm. Section 5 reports 
results from numerical simulations on benchmark datasets. 

2. Proposed Model 

Notation X is the input space, k : X x X — > M denotes the 
(positive) kernel function we have at hand and : X — >• H 
refers to the mapping 4>(x) := k(x, •) from X to the repro- 
ducing kernel Hilbert space H associated with k. Hence, 



Stochastic Low-Rank Kernel Learning for Regression 



k(x, x')=((f>(x), 4>(x')), with (•, •) the inner product of H. 2.2. Kernel Ridge Regression 



The training set is L := {(x h y l )}? =1 G (X X R) n , 
where yi is the target value associated to a^. K = 
(k(xi,Xj))i<i ! j< n € M. nxn is the Gram matrix of k with 
respect to C. For m — 1, . . . , n, c m <G M. n is defined as: 

c m . \k(x \ , x m ) , . . . , k{x n , x m )] 

2.1. Data-parameterized Kernels 

For m = 1, . . . , n, (j) m '■ X — > ?i m is the mapping: 



(1) 



It directly follows that fc m defined as, Vx, x' € A", 

k(x : x nl ^k(x , 3^771 ) 

is indeed a positive kernel. Therefore, these parameterized 
kernels fc m give rise to a family (-R' m )i< m <ri of Gram ma- 
trices of the following form: 



k m {x,x') := (^ m (a;),0 m (x')) = 



T 
m ■ 



(2) 



which can be seen as rank- 1 Nystrom approximations of the 
full Gram matrix K (Drineas & Mahoney, 2005; Williams 
& Seeger, 2001). 

As studied in (Kumar et al., 2009), it is sensible to con- 
sider convex combinations of the K m if they are of very 
low rank. Building on this idea, we will investigate the use 
of a parameterized Gram matrix of the form: 



K m with fi m > 0, 



(3) 



where S is a set of indices corresponding to the specific 
rank-one approximations used. Note that since we con- 
sider conical combinations of the K m , which are all posi- 
tive semi-definite, K(fi) is positive semi-definite as well. 

Using (1), one can show that the kernel k^, associated to 
our parametrized Gram matrix K(fi), is such that: 

k^x, x 1 ) = Ms), 4>(x')) A = (f>(x) T A0(x), (4) 



with A : = 



m£5 



(p(x m )(j>(x m ) T 



(5) 



In other words, our parametrization induces a modified 
metric in the feature space T~L associated to k. On a side 
note, remark that when S = {1 . . . , n} (i.e. all the columns 
are picked) and we have uniform weights then K(fj.) = 
KK T , which is a matrix encountered when working with 
the so-called empirical kernel map (Scholkopf et al., 1999). 

From now on, M denotes the size of S and m refers to 
the number of non-zero components of /j, (i.e. it is the 0- 
pseudo-norm of fi). 



Kernel Ridge regression (KRR) is the kernelized version 
of the popular ridge regression (Hoerl & Kennard, 1970) 
method. The associated optimization problem reads: 

min |a|H| 2 + £ {Vi ~ («, ^))f j , (6) 

where A > is a regularization parameter. 

Using / for the identity matrix, the following dual formu- 
lation may be considered: 

max { F KRR (a) := y T a - ^-a T (\I + K)a \ . (7) 

agl" ^ 4A J 

The solution a* of the concave problem (7) and the optimal 
solution w* of (6) are connected through the equality 



1 " 



2A 

and a* can be found by setting the gradient of Fkrr to 
zero, to give 

a* =2(I+{K)- 1 y. (8) 
The value of the objective function at a* is then: 

F KRR (a*)=y T (I+{K)- 1 y, (9) 
and the resulting regression function is given by: 



1 " 



(10) 



2.3. A Convex Optimization Problem 

KRR may be solved by solving the linear system (/ + 
— 2y, at a cost of 0(n 3 ) operations. This might 
be prohibitive for large n, even more so if the matrix I+^f 
does not fit into memory. To cope with this possible prob- 
lem, we work with K(fi) (3) instead of the Gram matrix 
K. As we shall see, this not only makes it possible to avoid 
memory issues but it also allows us to set up a learning 
problem where both /j, and a regression function are sought 
for at once. This is very similar to the Multiple Kernel 
Learning paradigm (Rakotomamonjy et al., 2008) where 
one learns an optimal kernel along with the target function. 

To set up the optimization problem we are interested in, we 
proceed in a way similar to (Rakotomamonjy et al., 2008). 
For m = 1, . . . , n, define the Hilbert space H' m as: 



f G H r , 



H/lk 



< oo 



(11) 
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mm 



One can prove (Aronszajn, 1950) that H = ©"H' m is the 
RKHS associated to k = iJL m k m - Mimicking the rea- 
soning of (Rakotomamonjy et al., 2008), our primal opti- 
mization problem reads: 

in S X J2 —\\fm\\% +i> i -£/™(* i ))4> 
s.t. ^2 Mm < ni , /i m > 0, (12) 

where ni is a parameter controlling the 1-norm of //. As 
this problem is also convex in /x, using the earlier results 
on the KRR problem, (12) is equivalent to: 

min {max y T a -a T (XI + K{n))a 

m>o a 4A 



mm 



in {y T (7 + ^(/x))- 1 ^} s.t. ^^ m < m. (13) 



mes 



Finally, using the equivalence between Tikhonov and 
Ivanov regularization methods (Vasin, 1970), we obtain the 
convex and smooth optimization problem we focus on: 

mmji^) :=y T {I+\K{n))- 1 y + vY,Vm\- (14) 



The regression function / is derived using (1), a minimizer 
fi* of the latter problem and the accompanying weight vec- 
tor a* such that 



a* = 2(J+AX( M *)J y, (15) 
(obtained adapting (8) to the case K = K(fi*)). We have: 



,T H a* m k(x m ,x) 



m£5 



2A 



m£5 



where a* m := ^ m 



(16) 



(17) 



3. Solving the problem 

We now introduce a new stochastic optimization procedure 
to solve (14). It implements a coordinate descent strategy 
with step sizes that use second-order information. 

3.1. A Second-Order Stochastic Coordinate Descent 

Problem (14) is a constrained minimization based on the 
differentiable and convex objective function F. Usual con- 
vex optimization methods (such as projected gradient de- 
scent, proximal methods) may be employed to solve this 



problem, but they may be too computationally expensive 
if n is very large, which is essentially due to a suboptimal 
exploitation of the parametrization of the problem. Instead, 
the optimization strategy we propose is specifically tailored 
to take advantage of the parametrization of K(fj,). 

Algorithm 1 depicts our stochastic descent method, in- 
spired by (Nesterov, 2010). At each iteration, a randomly 
chosen coordinate of fi is updated via a Newton step. This 
method has two essential features: i) using coordinate-wise 
updates of fi involves only partial derivatives which can be 
easily computed and ii) the stochastic approach ensures a 
reduced memory cost while still guaranteeing convergence. 

Algorithm 1 Stochastic Coordinate Newton Descent 

Input: n° random, 
repeat 

Choose coordinate m& uniformly at random in S. 
Update : /i^ 1 = [i k n if m ^ mk and 

Mm* -argmin g \v M™J+ 2 a M 2 ^ ' 

•U>0 k 

(18) 

until F{fi k ) - F(n k - M ) < eF(fi k - M ) 

Notice that the Stochastic Coordinate Newton Descent 
(SCND) is similar to the algorithm proposed in (Nesterov, 
2010), except that we replace the Lipschitz constants by the 
second-order partial derivatives d ^ ' . Thus, we replace 

a constant step-size gradient descent by a the Newton-step 
in (18), which allows us to make larger steps. 

We show that for the function F in (14), SCND does prov- 
ably converge to a minimizer of Problem (14). First, we 
rewrite (18) as a Newton step and compute the partial 
derivatives: 



Proposition 1. Eq. (18) is equivalent to 



u k+1 



otherwise. 



1 9<, 



if ■ 



^0 



(19) 



Proof. (19) gives the optimality conditions for (18). □ 



Proposition 2. The partial derivatives 9 g^i^ are: 



(20) 



™ = (-l)^!A(y T ^c m ) 2 (c^X A -i t c m r 1 , 

with p>2 andK^ := (XI + _1 . (21) 

Proof. Easy but tedious calculations give the results. □ 

Theorem 1 (Convergence). For any sequence {mk}k, the 
sequence {F(fi k )}k verifies: 
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(a) Vfe, F(n k+1 ) < F{p, k ). 

(b) lim fc _ ! . 0o F(fi k ) = min M > F((t). 

Moreover, if there exists a minimize r fi* of F such that the 
Hessian \7 2 F(p,*) is positive definite then: 



(c) fi* is the unique minimizer of F. The sequence 
converges to fi* : 1 1 fi k — /x* 1 1 — >0. 



Sketch of proof (a) Using that ^ft* 1 < (see (20)), one 
shows that the Taylor series truncated to the second or- 
der: ^%) + ^ (v m - M J + \ *™ (v m - 



and G id, the vectors and matrices before the update and 
Mnew Cnew, Miew and G new the updated matrices/vectors. 
Let also e p bethe vector whose pth coordinate is 1 while 
other coordinates are 0. We encounter four different cases. 

Case 1: [i p lA = and fi p ew = 0. No update needed: 

Gnew = Gold- (24) 

Case 2: fif d ^ and ^" ew ^ 0. Here, G i d = G new and 



-^new — ^old 



T 



A r) Cf> 6 



where A 



P * , ,new //old ' 



, is a quadratic upper-bound of F that matches F Then ' usin 8 Woodbury formula, we have: 



and V.F at point (for any fixed to and /x). From this, 
the update formula (18) yields < F(n k ). 

(b) First note that \\fj, || < F(n°) and extract a con- 
verging subsequence {/i.^^^}. Denote the limit by 



is zero or not, 



fi. Separating the cases where 
one shows that (i satisfies the optimality conditions: 
(VF(A),» - A) > 0, > 0. Thus /I is a mini- 
mizer of F and we have lim F(fJ, k ) = lim F(/x^ fc ) ) = 
= min^> F(fi). 

(c) is standard in convex optimization. □ 
3.2. Iterative Updates 

One may notice that the computations of the deriva- 
tives (20), as well as the computation of a*, depend on 
. Moreover, the dependency in fi, for all those quan- 
tities, only lies in . Thus, a special care need be taken 
on how is stored and updated throughout. 



Let <S+ = {to € <S|p m > 0} and m = ||/x|| 



\S+\. Let 

G = [cjj • • • Ci mQ ] be the concatenation of the c^.'s, for 
ij G <S+ and D the diagonal matrix with diagonal elements 
Hi j , for ij € 5+ . Remark that throughout the iterations the 
sizes of G and D may vary. Given (21) and using Wood- 
bury formula (Theorem 2, Appendix), we have: 



1 



/ - — GGG T (22) 



A 2 



with 



G:=[D 



A 



(23) 



Note that G is a square matrix of order too and that an 
update on /j, will require an update on G. Even though 



updating G 



i.e. D~ 



4-G T G, is trivial, it is more 



efficient to directly store and update G. This is what we 
describe now. 

At each iteration, only one coordinate of fi is updated. Let 
p be the index of the updated coordinate, /x old , G Q id, -Doid 



G n 



(^old 



T 



G 



old 



1 -|- /\ p g pp 



g p gj, 

(25) 



with g pp the (p, p)th entry of G id and g p its pth column. 

Case 3: ^ and ^" ew = 0. Here, = 

5+ \ {p}. It follows that we have to remove c p from 
Gold to have G new - To get G new , we may consider the 
previous update formula when /^™ w — > (that is, when 
A p — >• +oo). Note that we can use the previous formula 
because fi p M> is well-defined and continuous at 0. 

Thus, as lirrunew^o i z a p — = ~> we have: 

Cnew 



G, 



old 



ffpp / \ {p} 



(26) 



where denotes the matrix A from which the pth col- 

umn and pth row have been removed. 

Case 4: p p ld = and /^ ew ^ 0. We have G ne „ = 
[Gold c p ] . Using (23), it follows that 



G n 



D nM + t-GXiG, 



old _ "T" J,°old L/ old 
old 



A °old L P 

A P C P 



\ A C P ^ old 



A v 

, v T s 



A old L P 
Ai»™ T A C P C P 



-1 



where, using the block-matrix inversion formula of Theo- 
rem 3 (Appendix), we have: 



,.U( 



1 

new 



^ Cp Cp ^2 P old^old^old P 



V — Gold G id Cp 

-4 = G i d + 1 vv T . 

s 



(27) 
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Algorithm 2 SLKL: Stochastic Low-Rank Kernel Learning 

inputs: C := {(x^yi)}^, v > 0, M > 0, e > 0. 
outputs: £i, G and C (yield (XI + Kip))- 1 from (22)). 

initialization: /x^ ^ = 0. 
repeat 

Choose coordinate m& uniformly at random in 5. 
Update fjX k > according to (19), by changing only the 
m.fe-th coordinate fj, k of fi^ : 

• compute the second order derivative 

h = \(y T K^c^fic^K^c,^) ; 

• if h > then 

M^ 1 ' = max f 0, + ^ ' A >^ mfc - ) j ; 

else fJ-mt = 0- 

Update £?(*) and according to (24)-(27). 
until F(n k ) - F{v k - M ) < eF((j, k ~ M ) 

Complete learning algorithm. Algorithm 2 depicts 
the full Stochastic Low-Rank Kernel Learning algorithm 
(SLKL), which recollects all the pieces just described. 

4. Analysis 

Here, we discuss the relation between A and v and we argue 
that there is no need to keep both hyperparameters. In addi- 
tion, we provide a short analysis on the runtime complexity 
of our learning procedure. 

4.1. Pivotal Hyperparameter Xv 

First recall that we are interested in the minimizer fi* x v of 
constrained optimization problem (14), i.e.: 

V*\ v = argmin F Aj „ (/x) , (28) 

fM>0 

where, for the sake of clarity, we purposely show the de- 
pendence on A and v of the objective function F\ jV 

F x ^)=y T (i + K^y 1 y + XvY^^i (29) 

m 

We may name a* x , a* x v the weight vectors associated 
with fi* x v (see (15) and (17)). We have the following: 
Proposition 3. Let A, v, A', v' be strictly positive real num- 
bers. If Xv = XI v' then 

Ma>' = TV\,u> and h,v = h',w- 
As a direct consequence: 

VA, v > 0, fx, v = fi,Xv 



Proof. Suppose that we know [i* x . Given the defini- 
tion (29) of F\.v and using Xv = X'v' , we have 

Since the only constraint of problem (28) is the nonneg- 
ativity of the components of fi, it directly follows that 
X'fj,* x V /X is a minimizer of Fyy (under these constraints), 
hence n* x , y = X'^* X v /X. 

To show fx, v = f\',u>, it suffices to observe that, according 
to the way a* x v is defined (cf. (15)), 

a *x, y =2(l + K(^)y 1 y 

and, thus, a* x , v , — X'a* x V /X. The definition (16) of fx, v 
then gives f x , v = fx',v, which entails f x , v = h,Xv n 

This proposition has two nice consequences. First, it says 
that the pivotal hyperparameter is actually the product Xv: 
this is the quantity that parametrizes the learning problem 
(not A or v, seen independently). Thus, the set of regression 
functions, defined by the A and v hyperparameter space, 
can be described by exploring the set of vectors (/j,* v )v>a, 
which only depends on a single parameter. Second, con- 
sidering (fi* „)„>o allows us to work with the family of 
objective functions [Fi v ) v>Q , which are well-conditioned 
numerically as the hyperparameter A is set to 1. 

4.2. Runtime Complexity and Memory Usage 

For the present analysis, let us assume that we pre-compute 
the M (randomly) selected columns c 1; . . . , cjf. If a is the 
cost of computing a column c m , the pre-computation has a 
cost of O(AIa) and has a memory usage of O(nM). 

At each iteration, we have to compute the first and second- 
order derivatives of the objective function, as well as its 
value and the weight vector a. Using (22), (20), (14) and 
(15), one can show that those operations have a complexity 
of O(nmo) if mo is the zero-norm of fi. 

Besides, in addition to C, we need to store G for a mem- 
ory cost of O(mg). Overall, if we denote the number 
of iterations by k, the algorithm has a memory cost of 
0(nM + ttiq) and a complexity of O(knrrio + Ma). 

If memory is a critical issue, one may prefer to compute the 
columns c m on-the-fly and tuq columns need to be stored 
instead of M (this might be a substantial saving in terms 
of memory as can be seen in the next section). This im- 
provement in term of memory usage implies an additive 
cost in the runtime complexity. In the worst case, we have 
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to compute a new column c at each iteration. The result- 
ing memory requirement scales as O(nmo + TOq) and the 
runtime complexity varies as O(k(nmo + a)). 

5. Numerical Simulations 

We now present results from various numerical experi- 
ments, for which we describe the datasets and the protocol 
used. We study the influence of the different parameters of 
our learning approach on the results and compare the per- 
formance of our algorithm to that of related methods. 

5.1. Setup 

First, we use a toy dataset (denoted by ji'hc) to better un- 
derstand the role and influence of the parameters. It con- 
sists in regressing the cardinal sine of the two-norm (i.e. 
x i y sin(||cc||)/||cc||) of random two-dimensional points, 
each drawn uniformly between —5 and +5. In order to have 
a better idea on how the solutions may or may not over-fit 
the training data, we add some white Gaussian noise on the 
target variable of the randomly picked 1000 training points 
(with a 10 dB signal-to-noise ratio). The test set is made of 
1000 non-noisy independent instance/target pairs. 

We then assess our method on two UCI datasets: Abalone 
(abalone) and Boston Housing (boston), using the same 
normalizations, Gaussian kernel parameters (a denotes the 
kernel width) and data partition as in (Smola & Scholkopf, 

2000) . The United States Postal Service (USPS) dataset 
is used with the same setting as in (Williams & Seeger, 

2001) . Finally, the Modified National Institute of Standards 
and Technology (MNIST) dataset is used with the same pre- 
processing as in (Maji & Malik, 2009). Table 1 summarizes 
the characteristics of all the datasets we used. 



Table 1. Datasets used for the experiments. 



v=0.01 M=1000 



dataset 


#features 


#train (n) 


#test 


<r 2 


sine 


2 


1000 


1000 


1 


abalone 


10 


3000 


1177 


2.5 


boston 


13 


350 


156 


3.25 


USPS 


256 


7291 


2007 


64 


MNIST 


2172 


60000 


10000 


4 



As displayed in Algorithm 1, at each iteration k > M, we 
checkifF(/x' £ )-F(/x fc - M ) < eF(/x fc ~ M ) holds. If so, we 
stop the optimization process, e thus controls our stopping 
criterion. In the experiments, we set e = 10~ 4 unless oth- 
erwise stated and we set A to 1 for all the experiments and 
we run simulations for various values of v and M. In order 
to assess the variability incurred by the stochastic nature of 
our learning algorithm, we run each experiment 20 times. 




2 4 6 

log(iteration) 



Figure 1. Evolution of the objective during the optimization pro- 
cess for the sine dataset with v = 0.01, M = 1000 (for 20 runs). 




Figure 2. Zero-norm of the optimal /x* as a function of M for 
different values of v for the sine dataset (averaged on 20 runs). 



5.2. Influence of the parameters 

5.2.1. Evolution of the objective 

We have established (Section 3) the convergence of our op- 
timization procedure, under mild conditions. A question 
that we have not tackled yet is to evaluate its convergence 
rate. Figure 1 plots the evolution of the objective function 
on the sine dataset. We observe that the evolutions of the 
objective function are impressively similar among the dif- 
ferent runs. This empirically tends to assert that it is rele- 
vant to look for theoretical results on the convergence rate. 

A question left for future work is the impact of the random 
selection of the set of columns S on the reached solution. 

5.2.2. Zero-norm of /x 

As shown in Section 4.2, both memory usage and the com- 
plexity of the algorithm depend on mo. Thus, it is inter- 
esting to take a closer look at how this quantity evolves. 
Figure 2 and 3 experimentally point out two things. On the 
one hand, the number of active components mo = 1 1 fj, | | o re- 
mains significantly smaller than M, In other words, as long 
as the regularization parameter is well-chosen, we never 
have to store all of the c m at the same time. On the other 
hand, the solution /x* is sparse and ||/x*||o grows with M 
and diminishes with v. A theoretical study on the depen- 
dence of //* and mo in M and v, left for future work, would 
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v=0.01 M=1000 




Table 3. Number of errors and standard deviation on the test set 
(2007 examples) of the USPS dataset. 



M 


64 


256 


1024 


Nyst 


101.3 ± 22.9 


34.5 ± 3.0 


35.9 ±2.0 


SLKL 


76.3 ±9.9 


47.6 ±3.1 


41.5 ±3.9 


m 


61 


210 


515 



1000 2000 3000 4000 5000 6000 

iteration 

Figure 3. Evolution of the zero-norm of fi (mo) with the iterations 
for the sine dataset with v = 0.01, M = 1000 (20 runs). 



be all the more interesting since sparsity is the cornerstone 
on which the scalability of our algorithm depends. 

5.3. Comparison to other methods 

This section aims at giving a hint on how our method per- 
forms on regression tasks. To do so, we compare the Mean 
Square Error (over the test set). In addition to our Stochas- 
tic Low-Rank Kernel Learning method (SLKL), we solve 
the problem with the standard Kernel Ridge Regression 
method, using the n training data (KRRn) and using only M 
training data (KRRM). We also evaluate the performance 
of the KRR method, using the kernel obtained with uni- 
form weights on the M rank-1 approximations selected for 
SLKL (Unif). The results are displayed in Table 2, where 
the bold font indicates the best low-rank method (KRRM, 
Unif or SLKL) for each experiment. 

Table 2 confirms that optimizing the weight vector fi is de- 
cisive as our results dramatically outperform those of Unif. 
As long as M < n, our method also outperforms KRRM. 
The explanation probably lies in the fact that our approx- 
imations keep information about similarities between the 
M selected points and the n — M others. Furthermore, 
our method SLKL achieves comparable performances (or 
even better on abalone) than KRRn, while finding sparse 
solutions. Compared to the approach from (Smola & 
Scholkopf, 2000), we seem to achieve lower test error on 
the boston dataset even for M = 128. On the abalone 
dataset, this method outperforms ours for every M we tried. 

Finally, we also compare the results we obtain on the USPS 
dataset with the ones obtained in (Williams & Seeger, 
2001) (Nyst). As it consists in a classification task, we 
actually perform a regression on the labels to adapt our 
method, which is known to be equivalent to solving Fisher 
Discriminant Analysis (Duda & Hart, 1973). The perfor- 
mance achieved by Nyst outperforms ours. However, one 
may argue that the performance have a same order of mag- 
nitude and note that the Nyst approach focuses on the clas- 
sification task, while ours was designed for regression. 



5.4. Large-scale dataset 

To assess the scalability of our method, we ran experiments 
on the larger handwritten digits MNIST dataset, whose 
training set is made of 60000 examples. We used a Gaus- 
sian kernel computed over histograms of oriented gradients 
as in (Maji & Malik, 2009), in a "one versus all" setting. 
For M = 1000, we obtained classification error rates around 
2% over the test set, which do not compete with state-of- 
the-art results but achieve reasonable performance, consid- 
ering that we use only a small part of the data (cf. the size 
of M) and that our method was designed for regression. 

Although our method overcomes memory usage issues for 
such large-scale problems, it still is computationally inten- 
sive. In fact, a large number of iterations is spent picking 
coordinates whose associated weight remains at 0. Though 
those iterations do not induce any update, they do require 
computing the associated Gram matrix column (which is 
not stored as it does not weigh in the conic combination) as 
well as the derivatives of the objective function. The main 
focus of our future work is to avoid those computations, us- 
ing e.g. techniques such as shrinkage (Hsieh et al., 2008). 

6. Conclusion 

We have presented an original kernel-based learning pro- 
cedure for regression. The main features of our contri- 
bution are the use of a conical combination of data-based 
kernels and the derivation of a stochastic convex optimiza- 
tion procedure, that acts coordinate-wise and makes use of 
second-order information. We provide theoretical conver- 
gence guarantees for this optimization procedure, we de- 
pict the behavior of our learning procedure and illustrate its 
effectiveness through a number of numerical experiments 
carried out on several benchmark datasets. 

The present work naturally raises several questions. 
Among them, we may pinpoint that of being able to estab- 
lish precise rate of convergence for the stochastic optimiza- 
tion procedure and that of generalizing our approach to the 
use of several kernels. Establishing data-dependent gener- 
alization bounds taking advantage of either the one-norm 
constraint on fi or the size M of the kernel combination is 
of primary importance to us. The connection established 
between the one-norm hyperparameter v and the ridge pa- 
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Table 2. Mean square error with standard deviation measured on three regression tasks. 



M 


256 


sine 
512 


1000 


128 


boston 
256 


350 


512 


abalone 
1024 


3000 


KRRn 




0.009 ± 09 






10.17 ±0 






6.91 ±0 




KRRM 


0.0146 

±le~ 3 


0.0124 
±7e" 4 


0.0099 

±0 


33.27 
±7.8 


16.89 
±3.27 


10.17 

±0 


6.14 
±0.25 


5.51 
±0.09 


5.25 
±0 


Unif 


0.0124 

±le- 4 


0.0124 

±3e -5 


0.0124 
±0 


149.7 
±5.57 


147.84 
±2.24 


147.72 
±0 


10.04 
±0.17 


9.96 
±0.06 


9.99 
±0 


SLKL 

m 


0.0106 

±4e" 4 
83 


0.0103 

±2e~ 4 
108 


0.0104 

±le" 4 
139 


20.17 

±2.3 
108 


13.1 

±0.87 
161 


11.43 
±0.06 
184 


5.04 

±0.08 
159 


4.94 

±0.03 
191 


4.95 

±0.004 
253 



rameter A, in section 4, seems interesting and may be wit- 
nessed in (Rakotomamonjy et al., 2008). Although not 
been mentioned so far, there might be connections between 
our modeling strategy and boosting/leveraging-based opti- 
mization procedures. Finally, we plan on generalizing our 
approach to other kernel methods, noting that rank-1 up- 
date formulas as those proposed here can possibly be ex- 
hibited even for problems with no closed-form solution. 
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A. Matrix Inversion Formulas 

Theorem 2. (Woodbury matrix inversion formula (Wood- 
bury, 1950)) Let n and m be positive integers, A £ R nxn 
and C G R mXra be non-singular matrices and let U £ 
E nxm and V &R mxn be two matrices. IfC^ + VA^U 
is non-singular then so is A+UCV and: 

(A+UCV)- 1 = A~ 1 -A~ 1 U(C~ 1 +VA~ 1 U)~ 1 VA~ 1 . 

Theorem 3. (Matrix inversion with added column) Given 
m, integer and M £ K(™+ 1 ) x ("+ 1 ) partitioned as: 

M ={y b \ where A g R nx ", b e R n and cER. 

If A is non-singular and c — b T A~ x b 7^ 0, then M is non- 
singular and the inverse of M is given by 

M-'^-'tf^"' "if*). (30) 
where k = c — b T A _1 b. 
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